Using Background Knowledge to Improve Inductive Learning of DNA Sequences
نویسندگان
چکیده
Successful inductive learning requires that training data be expressed in a form where underlying regularities can be recognized by the learning system. Unfortunately , many applications of inductive learning| especially in the domain of molecular biology|have assumed that data are provided in a form already suitable for learning, whether or not such an assumption is actually justiied. This paper describes the use of background knowledge of molecular biology to re-express data into a form more appropriate for learning. Our results show dramatic improvements in classiication accuracy for two very diierent classes of DNA sequences using traditional \oo-the-shelf" decision-tree and neural-network inductive-learning methods. The goal of inductive learning is to take a collection of labeled \training" data and form a classiier that accurately predicts the labels of future data. To be successful it is essential that the data be expressed in a form where the learner can recognize underlying regularities. This problem of nding an appropriate encoding for training data is widely recognized as being central for all inductive learning eeorts. However, most work in inductive learning has traditionally assumed that the data are initially provided in a form that is already appropriate for learning (whether or not such an assumption is actually justiied). In the notable exceptions when such an assumption is not made, the most common approach has been to create a set of features carefully engineered for the particular learning problem. For example, Quinlan 19] documents his two-person-month eeort deening an appropriate set of features for learning the chess We thank Steve Norton and Lorien Pratt for helpful discussions and comments. end-game of lost 3-ply with a decision-tree learner. Similarly, Kamm and Singhal 10] describe the process by which diierent methods for sampling a speech signal were explored to nd the one that yields the best encoding of data for learning by a neural-network learner. The representation of training data is particularly important when learning classiiers for DNA sequences. In this domain training data are usually represented as strings over a four-letter alphabet; each character represents a nucleotide, which is the fundamental unit of the DNA polymer. All past learning eeorts have attempted to learn directly from these raw data. However , this representation only expresses low-level information about nucleic-acid sequences and does not encode any important features that biologists nd useful in describing regions of DNA. We began with the assumption that the use of an appropriate …
منابع مشابه
Recursive Feature Generation for Knowledge-based Learning
When humans perform inductive learning, they often enhance the process with background knowledge. With the increasing availability of well-formed collaborative knowledge bases, the performance of learning algorithms could be significantly enhanced if a way were found to exploit these knowledge bases. In this work, we present a novel algorithm for injecting external knowledge into induction algo...
متن کاملIncremental Learning in Inductive Programming
Inductive programming systems characteristically exhibit an exponential explosion in search time as one increases the size of the programs to be generated. As a way of overcoming this, we introduce incremental learning, a process in which an inductive programming system automatically modifies its inductive bias towards some domain through solving a sequence of gradually more difficult problems ...
متن کاملInductive vs. Deductive Grammar Instruction and the Grammatical Performance of EFL Learners
Learning a foreign language offers a great challenge to students since it involves learning different skills and subskills. Quite a few number of researches have been done so far on the relationship between gender and learning a foreign language. On the other hand, two major approaches in teaching grammar have been offered by language experts, inductive and deductive. The present study examines...
متن کاملDNA Sequence Classi cation Using Compression-Based Induction
Inductive learning methods, such as neural networks and decision trees, have become a popular approach to developing DNA sequence identi cation tools. Such methods attempt to form models of a collection of training data that can be used to predict future data accurately. The common approach to using such methods on DNA sequence identi cation problems forms models that depend on the absolute loc...
متن کاملIncorporating Prior Domain Knowledge Into Inductive Supervised Machine Learning Incorporating Prior Domain Knowledge Into Inductive Machine Learning
The paper reviews the recent developments of incorporating prior domain knowledge into inductive machine learning, and proposes a guideline that incorporates prior domain knowledge in three key issues of inductive machine learning algorithms: consistency, generalization and convergence. With respect to each issue, this paper gives some approaches to improve the performance of the inductive mach...
متن کامل